... a data visualization technique used for representing text data.
It can provide a quick visual insight and lead to more in-depth analyses.
A word cloud is a collection, or cluster, of words depicted in different sizes. The bigger and bolder the word appears, the more often it's mentioned within a text.
Significant textual data points can be highlighted using a word cloud.
The first step is Data Wrangling , which includes:
- Gather the data ,
- Assess data's quality and structure,
- Modify data to make it clean.
Import the necessary libraries:
requests ( popular HTTP library for Python )
pdfplumber( helps extract text from PDF)
pandas (data manipulation and analysis)
Image (display images)
io (manage the file-related input and output operations)
matplotlib (a plotting library)
wordcloud
import requests
import pdfplumber
import pandas as pd
from IPython.display import Image
from IPython.core.display import HTML
from io import BytesIO
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import time
I will obtain programmatically the April 2022 IMF World Economic Outlook report.
IMF_2022 is the URL that contains the dataset, commentary from the IMF chief economist, a number of articles and the report under the Download Full Report button.
Image('C:/Users/IMF/Images/wc_4.PNG', width=400, height=400)
# IMF_website = https://www.imf.org/en/Publications/WEO/Issues/2022/04/19/world-economic-outlook-april-2022
IMF_Apr_2022 = "https://www.imf.org/-/media/Files/Publications/WEO/2022/April/English/text.ashx"
report_2022 = requests.get(IMF_Apr_2022)
IMF_Apr_2022 is the URL that contains the PDF report.
The report_2022 object includes all information about the report.
A test to see if the code above was correct.
Printing report_2022 gets a response:
If the HTTP request was successful, the standard response is 200.
Otherwise,the 404 response indicates that it was not successful.
print(report_2022);
<Response [200]>
Other information that report_2022 object holds is the report's URL:
print(report_2022.url)
https://www.imf.org/-/media/Files/Publications/WEO/2022/April/English/text.ashx
The report is in PDF format and it will be inspected visually.
# Save the pdf reports
#open("C:/Users/papadd/Downloads/IMF_Apr2022_report.pdf", 'wb').write(resp_2022.content)
open("C:/Users/IMF/report_2022.pdf", 'wb' ).write(report_2022.content)
5790119
Image('C:/Users/IMF/Images/IMF_2022.PNG', width=400, height=400)
Not all of the report's 200 pages are needed to create the Wold Cloud. For example, pages with graphs and data tables will be omitted.
Specifically, pages 17-119 contain the main report, thus this text subset will be extracted.
Finally, the Word Cloud generator will require the report in .txt format.
pdf = pdfplumber.open(BytesIO(report_2022.content))
with open("C:/Users/IMF/report_2022.txt", 'w',encoding="utf-8") as r_2022:
for page in pdf.pages[17:120]:
r_2022.write(page.extract_text())
The list stopwords below includes words that do not hold any meaningful information, thus the Word Cloud generator will ignore:
stopwords = ['will', 'and', 'are', 'to', 'with', 'in', 'the', 'of', 'a', 'on', 'by', 'that', 'have', 'as', 'countries', 'than',\
'at', 'it', 'how', 'has', 'percent', 'be', 'for', 'more', 'many', 'but','i', 'is', 'from', 'an', 'also']
start = time.time()
text = open("C:/Users/IMF/report_2022.txt",encoding="utf-8").read()
wordcloud = WordCloud(max_font_size=60,width=800, height=400, stopwords = stopwords, max_words = 70).generate(text)
plt.figure( figsize=(20,10) )
plt.imshow(wordcloud, interpolation="bilinear");
plt.axis("off");
end = time.time()
print(end - start)
0.24799299240112305
The visual representation of data shown in the image above helps us understand the crux of the report.
Also, it can stimulate more questions than it answers, but that's often a good entry point to discussion.
More specifically:
Image('C:/Users/IMF/Images/wc_1.PNG', width=400, height=400)
The analysis and projections contained in the World Economic Outlook highlight the importance of the developments at the first half of 2022 around Russia, Ukraine, and war, and the World Cloud has applied a big font on their textual representation.
Image('C:/Users/IMF/Images/wc_2.PNG', width=400, height=400)
Inflation had surged in many economies because of soaring commodity prices and pandemic-induced supply-demand imbalances. In many countries, inflation has become a central concern.Word Cloud has applied one of the biggest fonts on inflation.
Image('C:/Users/IMF/Images/wc_3.PNG', width=400, height=400)
The world was unprepared for the COVID-19 pandemic, and still remains vulnerable to any future Covid variations. There was a huge and necessary fiscal expansion in many countries during the pandemic, thus the word is repeated throughout the report.
In conclusion, a Word Cloud analysis of the April 2022 IMF World Economic Outlook report is a data visualization inference that provides a screenshot of the most important themes contained within the report.